## Loading required package: viridisLite
## Loading required package: lattice
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
Now calculate sens/spec using thresholded probabilities (> or < .5), and add 95% CI.
best <- dt[,.(mean.pcovid=mean(pcovid),
threshold.pcovid=mean(pcovid>0.5), ## TODO need to do this by FOLD
n=.N),
by=c("type","method","fold.type","fold.nm")]
best[type=="Historical.controls",threshold.pcovid:=1-threshold.pcovid]
best[,threshold.se:=sqrt(threshold.pcovid * (1-threshold.pcovid) / n)]
ci=with(best, binconf(threshold.pcovid*n, n, method="wilson"))
best[,threshold.lower:=ci[,"Lower"]][,threshold.upper:=ci[,"Upper"]]
## average over folds
best=best[,.(pcovid=mean(threshold.pcovid),
lower=mean(threshold.lower),
upper=mean(threshold.upper)),
by=c("type","method")]
best[type=="Historical.controls",type:="specificity"]
best[type=="COVID",type:="sensitivity"]
thresholded <- dcast(best,method ~ type,
value.var=c("pcovid", "lower", "upper"))
head(thresholded)
fwrite(thresholded[,.(method,
pcovid_sensitivity,lower_sensitivity,upper_sensitivity,
pcovid_specificity,lower_specificity,upper_specificity)],
file=file.path(d,"spec-sens+ci.csv"))
## comb[method=="ens"]
## comb[method=="ens2"]
We compared a series of machine learning algorithms according to their ability to correctly infer infection status PCR+ COVID donors and historical blood donor controls, using 10-fold cross-validation. Total samples used were
## type
## COVID Historical.controls
## 151 1801
with some controls run in duplicate, so that the unique sample counts were
## type
## COVID Historical.controls
## 151 608
We considered three strategies for cross-validation:
We sought a method with performance that was consistently good across all cross-validation sampling schemes, because the true proportion of cases in the test data is unknown, and we want a method that is not overly sensitive to the proportion of cases in the training data. We chose to assess performance using sensitivity and specificity, as well as consistency.
Training models on ELISAs for both proteins simultaneously (RBD and SPIKE), we found all methods worked well, with sensivity >98% and specificity >99.6%. On this metric, LDA gave the highest specificity. Logistic regression had similarly high sepcificity on some folds of these training data, but with higher sensivity. However, logistic regression showed the lowest consistency, which reflects that the proportion of cases in a sample directly informs a logistic model’s estimated parameters. SVM methods had lower specificity than LDA in the training data, but higher sensitivity. We chose to also create ensemble learners which were an unweighted average of SVM (linear) or SVM2 (quadratic) and LDA to balance the benefits of each approach. The standard methods, calling positives by a fixed number of SD above the mean in controls displayed two extreme behaviours: 3-SD had the highest sensitivity (100%) while 6-SD had the highest specificity, and the lowest sensitivity, emphasising that the number of SD above the mean is a key parameter, but one which is typically not learnt in any formal data-driven manner.
Given the overall good performance of all learners, we considered the prediction surface associated with SVM, LDA, SVM-LDA ensemble, and the standard 3-SD, 6-SD hard decision boundaries. Note that while methods trained on both proteins can draw decision contours at any angle, SD methods are limited to vertical or horizontal lines. We can see that success, or failure, of the SD cut-offs depends on how many positive and negative cases overlap for a given measure (SPIKE or RBD) in the training sample. In the training data the two classes are nearly linearly separable when each protein is considered on its own (top panels), which explains good performance of 3-SD and 6-SD thresholds. However, the test data (bottom panels) contain many more points in the mid-range of SPIKE-RBD, which makes hard cut-offs a problematic choice for classifying test samples.
Both SVM and LDA offer linear classification boundaries but we can see that probability transition from negative to positive cases is much sharper for LDA, potentially resulting in false negatives when applied to the test data, but giving the model high specificity in the training data under cross-validation. SVM exhibits a softer probability transition around its classification boundary, offering a much more nuanced approach to the points lying in the mid-range of the two proteins. SVM2 (quadratic SVM) creates a nonlinear boundary, but the cross validation suggested that this didn’t improve performance relative to linear SVM. Finally, the ensemble learners seemed to combine the benefits of their parent methods. The test data points in the lower right region of each plot are the hardest to classify due to the relative scarcity of observations in this region in the test dataset. The ENS learner shows the greatest uncertainty in this regions, appropriately. We chose to use the ensemble SVM-LDA method to analyse the test data.
This is a close up of the decision boundary for the final ensemble and its parents
Each sample OD was standardised by dividing by the mean OD of no sample controls on that plate (xx samples) or on other plates run on that day (xx samples). This resulted in more similar distributions for 2019 blood donor samples that were run in duplicate with 2020 blood donors and pregnant volunteers, as well as smaller coefficients of variation amongst PCR+ COVID patients for both SPIKE and RBD.
Logistic regression and linear discriminant analysis (LDA) both model log odds of a sample being case as a linear equation with a resulting linear decision boundary. The difference between the two methods is in how the coefficients for the linear models are estimated from the data. When applied to new data, the output of logistic regression and LDA is the probability of each new sample being case.
Support vector machines (SVM) is an altogether different approach. We opted for a linear kernel, once again resulting in a linear boundary. SVM constructs a boundary that maximally separates the classes (i.e. the margin between the closest member of any class and the boundary is as wide as possible), hence points lying far away from their respective class boundaries do not play an important role in shaping it. SVM thus puts more weight on points closest to the class boundary, which in our case is far from being clear. Linear SVM has one tuning parameter C, a cost, with larger values resulting in narrower margins. We tuned C on a vector of values (0.001, 0.01, 0.5, 1, 2, 5, 10) via an internal 5-fold CV with 5 repeats (with the winning parameter used for the final model for the main CV iteration). We also note that the natural output of SVM are class labels rather than class probabilities, so the latter are obtained via the method of Platt [1].
The SVM-LDA ensemble was defined as an unweighted average of the probabilities generated under the SVM and LDA models.
N-fold cross-validation (CV) is a statistical procedure which allows to evaluate predictive performance of a model when the dataset used is too small to set aside a designated test subset. In CV the data is divided randomly into N subsets of equal size, folds (typically N = 5 or 10). At each step i=1,…,N we train our model on a subset of the data excluding fold i, on which the trained model is tested. The resulting vector of cross-validated predictions is of the same length as the vector of output values y, and can be used to calculate predictive accuracy of the model.
We opted for 10-fold CV and generated 9 sets of 10 random folds: 3 sets of stratified folds, where the ratio of cases and controls is kept constant at the overall sample level in each fold, 3 sets of randomly sampled folds, and 3 sets of folds where ratio of cases and controls was unbalanced on purpose (reflecting a more realistic situation). The accuracy measures were calculated by averaging over these 9 fold sets.
[1] Platt, John. “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10.3 (1999): 61-74. [2] https://www.tandfonline.com/doi/abs/10.1080/00031305.2018.1473796